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Abstract 

This  paper  introduces  the  architecture  and  initial  algorithms  for  Pixel-planes  5,  a  heterogeneous  multi-computer 
designed  both  for  high-speed  polygon  and  sphere  rendering  (IM  Phong-shaded  triangles/second)  and  for  supporting 
algorithm  and  a^lication  research  in  interactive  3D  graphics.  Techniques  are  described  for  volume  rendering  at 
multiple  fhunes  per  second,  font  generation  directly  from  conic  spline  descriptions,  and  rapid  calculation  of  radiosity 
form  factors.  The  hardware  consists  of  up  to  32  math-oriented  processors,  up  to  16  rendering  units,  and  a 
conventional  1280x1 024-pixel  frame  buffer,  interconnected  by  a  5  gigabit  ring  networic.  Each  rendering  unit 
consists  of  a  128xl28-pixel  array  of  processors-with-memory  with  parallel  quadratic  expression  evaluation. 
Implemented  on  fast  custom  CMOS  chips,  this  array  has  208  bits/pixel  on-chip  and  is  connected  to  a  video  RAM 
memory  system  that  provides  4,096  bits  of  off-chip  memory.  Rendering  units  can  be  independently  reassigned  to 
any  part  of  the  screen  or  to  non-screen-oriented  computation.  A  message-passing  operating  system  encourages 
algorithms  to  mix  and  match  capabilities  of  the  massively  parallel  rendering  units  with  those  of  the  math-oriented 
processors.  As  of  January  1989,  both  hardware  and  software  are  still  under  construction,  with  initial  system 

operation  scheduled  for  summer  1989.  f  \  ^  ■  'N 

'  ^  ‘  )  C: - 

1.  Introduction 

Many  computer  applications  seek  to  create  an  illusion  of  interaction  with  a  virtual  world.  Vehicle  simulation, 
geometric  modeling  and  scientific  visualization,  for  example,  all  require  rapid  display  of  computer-generated  imagery 
that  changes  dynamically  according  to  the  user's  wishes.  Much  progress  has  been  made  in  developing  high-speed 
rendering  hardware  over  the  past  several  years,  but  even  the  current  generation  of  graphics  systems  can  render  only 
modest  scenes  at  interactive  rates. 

For  many  years  our  research  goal  has  been  the  pursuit  of  Ualy  interactive  graphics  systems.  To  achieve  the 
necessary  rendering  speeds  and  to  provide  a  platform  for  real-time  algorithm  research,  we  have  developed  the  parallel 
image  generation  architecture  called  Pixel-planes  fFuchs  81,  Fuchs  82,  Poulton  85].  We  briefly  describe  the  basic 
ideas  in  the  architecture: 

*  This  work  supported  ♦'y  the  Defense  Advanced  Research  Projects  Agency,  DARPA  ISTO  Order  No.  6090,  the 
National  Science  Foundation,  Grant  No.  DCI-8601 152,  and  the  Office  of  Naval  Research,  Contract  No.  NOO 14-86- 
K-0680. 
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Each  pixel  is  provided  with  a  minimal,  though  general,  processor,  together  with  local  memory  to  store  pixel  color, 
z-depth,  and  other  pixel  information.  Each  processor  receives  a  distinct  value  of  a  linear  expression  in  screen-space. 
Ax  -f  By  C,  whoe  A,B,C  are  data  inputs  and  x,y  is  the  pixel  address  in  screen-space.  These  expressions  are 
generated  in  a  parallel  linear  expression  evaluator,  composed  of  a  binary  tree  of  tiny  muldply-accumulator  nodes.  A 
custom  VLSI  chip  contains  pixel  memory,  together  with  the  relatively  compact  pixel  processors  and  the  linear 
erqnession  evaluator,  both  implemented  in  bit-serial  circuitry.  An  array  of  these  chips  forms  a  "smart"  frame  buffer, 
a  2D  computing  surface  that  receives  descripdons  of  graphics  primidves  in  the  form  of  coefficients  (A3,C)  with 
instmcdons  and  locally  performs  all  pixel-level  rendering  computadons.  Since  instructions,  mernwy  addresses,  and 
A3.C  coefficients  are  broadcast  to  all  processors,  the  smart  frame  buffer  forms  a  Single-Instrucdon-Muldple- 
Datastream  computer,  and  has  a  very  simple  connecdon  topology.  Instnicdons  (including  memory  addresses  and 
A3>Cs)  are  generated  in  a  convendonal  graphics  transformadon  engine,  with  the  reladvely  minor  addidonal  task  of 
converting  screen-space  polygon  verdces  and  colors  into  the  form  of  linear  expressions  and  instrucdons. 

In  1986  we  completed  a  full-scale  prototype  Pixel-planes  system.  Pixel-planes  4  {jPxpl4)  [Poulton  87,  Eyles  88], 
which  tenders  39,(X)0  Gouraud-shaded,  z-buffered  polygons  per  second  (13,000  smooth-shaded  interpenetradng 
spheres/second,  11, (XX)  shadowed  polygons/second)  on  a  512x512  pixel  full-color  display.  While  this  system  was  a 
successful  research  vehicle  and  is  extremely  useful  in  our  department's  computer  graphics  laboratory,  it  is  too  large 
and  expensive  to  be  pracdcal  outside  of  a  research  setdng.  Its  main  limitadons  are: 

•  large  amount  of  hardware,  often  utilized  poorly  (particularly  when  rendering  small 
primidves) 

•  hard  limit  on  the  memory  available  at  each  pixel  (72  bits) 

•  no  access  to  pixel  data  by  the  transformadon  unit  or  host  computer 

•  insufficient  tont-end  computadon  power 

This  paper  describes  its  successor.  Pixel-planes  5  (JPxplS),  which  we  expect  to  have  running  by  mid- 1989.  PxplS 
uses  screen  subdivision  and  muldple  small  rendering  units  in  a  modular,  expandable  architecture  to  address  the 
problem  of  processor  utilizadon.  A  full-size  system  can  tender  in  excess  of  one  million  Phong-shaded  triangles  per 
second.  Sufficient  "front  end"  power  for  this  level  of  performance  is  provided  by  a  MIMD  array  of  general-purpose 
math-oriented  processors.  The  machine's  muldple  processors  communicate  over  a  high-speed  network.  Its 
organizadon  is  sufficiendy  general  that  it  can  efficiendy  render  curved  surfaces,  volume-defined  data  and  CSG-defined 
objects.  In  addidon  it  can  rapidly  perform  various  image-processing  algorithms.  PxplS's  rendering  units  each  are 
5x  faster  than  Pxpl4  and  contain  more  memory  per  pixel,  distributed  in  a  memory  hierarchy:  208  bits  of  fast  local 
storage  on  its  processor-enhanced  memory  chips,  4K  bits  of  memory  per  pixel  processor  in  a  VRAM  "backing 
store",  and  a  separate  frame  buffer  that  refreshes  normal  and  stereo  images  on  a  1280x1024  72Hz  display. 

2.  Perspective 

Raster  graphics  Sj  stems  gencraily  contain  two  distinct  parts:  a  graphics  transformadon  engine  that  transforms  and 
lights  the  geometric  description  of  a  scene  in  accordance  with  the  user's  viewpoint  and  a  Tenderer  that  paints  the 
transformed  scene  onto  a  screen. 


Designs  for  fast  transformadon  units  have  often  cast  the  scries  of  discrete  steps  in  the  transformation  process  onto  a 
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pipeline  of  processing  elements,  each  of  which  does  one  of  the  steps  [Clark  82].  As  perfcnmance  requirements 
increase,  howev<»’,  simple  pipelines  begin  to  experience  communication  bottlenecks,  so  designers  have  turned  to 
multiple  pipelines  [Runyon  87]  or  have  spread  the  work  at  some  stages  of  the  pipe  across  multiple  processors 
[Akeley  88].  Vector  organizations  offer  a  simple  and  el^ective  way  to  harness  the  power  of  multiple  processors,  and 
have  been  used  in  the  fastest  current  graphics  workstations  (Apgar  88,  Diede  88].  Wide  vector  organizations  may 
have  difficulty  with  data  structures  of  arbitrary  size,  such  as  those  that  implement  the  PHIGS-t-  standard,  so  at  least 
one  commercial  offering  divides  the  work  across  multiple  processois  operating  in  MIMD  fashion  [Torberg  87]. 

The  rendering  problem  has  generally  been  much  more  difficult  to  solve  because  it  requires,  in  principal, 
computations  for  every  pixel  of  every  primitive  in  a  scene.  To  achieve  interactive  speeds  on  woikstation-class 
machines,  parallel  rendering  engines  have  become  the  rule.  These  designs  must  all  deal  with  the  memory  bandwidth 
bottleneck  at  a  raster  system's  frame  buffer.  Three  basic  strategies  for  solving  this  problem  are: 

Rendering  Pipelines.  The  rendering  problem  can  also  be  pipelined  over  multiple  processors.  The  Hewlett- 
Packard  SRX  graphics  system  [Swanson  86],  for  example,  uses  a  pipeline  of  processors  implemented  in  custom 
VLSI  that  simultaneously  perform  6-axis  interpolations  for  visibility  and  shading,  opmting  on  data  in  a  pixel  cache. 

The  frame  buffer  bandwidth  bottleneck  can  be  ameliorated  by  writing  to  the  frame  buffer  only  the  final  colors  of  the 
visible  pixels.  This  can  only  be  achieved  if  all  the  primitives  that  may  affect  a  pixel  are  known  and  considered 
before  that  pixel  is  written.  Sorting  primitives  by  screen  position  minimizes  the  number  that  have  to  be  considered 
for  any  one  pixel.  Sorting  Hrst  by  Y,  then  by  X  achieves  a  scan-line  order  that  has  been  popular  since  the  late 
I960's  and  is  the  basis  for  several  types  of  real-time  systems  [Watkins  70].  The  basic  strategy  has  been  updated  by 
several  groups  recently.  The  SuperBuffer  design  [Ghatachorloo  85]  contained  a  processor  for  every  pixel  on  a  scan¬ 
line.  Data  for  primitives  active  on  a  scan-line  pass  by  this  array,  and  visible  pixel  colors  are  emitted  at  video  rates; 
no  separate  frame  buffer  is  required.  This  work  continues  at  IBM/TJW  on  a  system  called  SAGE  [Gharachorloo  88]. 
Researchers  at  Schlumberger  [Deering  88]  recently  proposed  a  system  in  which  visibility  and  Phong-shading 
processors  in  a  pipeline  are  assigned  to  the  objects  to  be  rendered  on  the  current  scan  line.  The  latter  two  projects 
promise  future  commercial  offerings  that  can  render  of  order  IM  triangles  per  second  with  remarkably  little  hardware, 
though  designs  for  the  front  ends  of  these  systems  have  yet  to  be  published.  These  machines  have  each  cast  one 
particular  rendering  algorithm  into  hardware,  enabling  a  lower-cost  solution  but  one  not  intended  for  internal 
programming  by  users.  New  algorithms  cannot  easily  be  mapped  onto  hardware  for  scan-line  ordered  pipelines. 
Finally,  a  difficulty  with  these  designs  is  ensuring  graceful  performance  degradation  for  scenes  with  exceptional 
numbers  of  primitives  crossing  a  given  scan-line. 

Interlaced  Processors.  As  first  suggested  a  decade  ago  [Fuchs  77,  Fuchs  79,  Clark  80],  the  frame  buffer 
memory  can  be  divided  into  groups  of  memory  chips,  each  with  its  own  rendering  processor,  in  an  interlaced  fashion 
(each  processor-with-memory  handies  every  nth  pixel  oji  every  mih  ro*v).  The  rendering  task  is  distributed  evenly 
across  the  multiple  processors,  so  the  effective  bandwidth  into  the  frame  buffer  increases  by  a  factor  of  m-n.  This 
idea  is  the  basis  of  several  of  the  most  effective  current  raster  graphics  systems  [Akeley  88,  Apgar  88].  Some  of 
these  systems,  however,  arc  again  becoming  limited  by  the  bandwidth  of  commercial  DRAMs  [Whitton  84],  With 
increasing  numbers  of  processors  operating  in  SIMD  fashion,  processor  utilization  begins  to  suffer  because  fewer 
processors  are  able  to  operate  on  visible  pixels,  the  "write  efficiency"  problem  discussed  in  [Deering  88].  Raising 
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the  performance  of  interiaced  processors  by  an  order  of  magnitude  will  probably  require  more  complex  organizations 
or  new  memory  devices. 

Processor-Enhanced  Memories.  Much  higher  memory  bandwidth  can  be  obtained  by  combining  some 
processing  circuitry  on  the  same  chip  with  dense  memory  circuits.  The  most  widely  used  example  of  a  "smart" 
memtny  is  the  Video  RAM  (VRAM),  introduced  by  Texas  Instruments.  Its  only  enhancement  is  a  second,  serial- 
access  port  into  the  frame  buffer  memory;  nevertheless  these  parts  have  had  a  great  impact  on  graphics  system 
design.  The  SLAM  system,  described  some  years  ago  in  [Demetiescu  8S],  combines  a  2-D  frame  buffer  memory 
with  an  on-chip  parallel  1-D  span  computation  unit;  it  appears  to  offer  excellent  performance  for  some  2D 
applications  but  requires  external  processing  to  divide  incoming  primitives  into  scan-line  slices.  Recently  NEC 
announced  a  commercial  version  of  an  enhanced  VRAM  that  performs  many  common  functions  needed  in  2-D 
windowing  systems.  This  ^>proach  has  been  the  focus  of  our  work  since  1980;  in  the  Pixel-planes  architecture  we 
have  attempted  to  remove  the  memory  bottleneck  by  performing  essentially  all  pixel-oriented  rendering  tasks  within 
the  frame  buffer  memory  system  itself. 

The  architecture  we  will  describe  below  employs  a  MIMD  array  of  processors  in  its  transformation  unit  and  seeks  to 
make  more  effective  use  of  the  processor-enhanced  memory  approach. 

3.  Project  Goals 

We  wanted  Pixel-planes  5  lo  be  a  platform  for  research  in  graphics  algorithms,  applications  and  architectures,  and  a 
testbed  for  refinements  that  would  enhance  the  cost  effectiveness  of  the  approach.  To  this  end,  we  adopted  the 
following  goals: 

•  Fast  Polygon  Rendering.  Despite  all  the  interest  in  higher-order  primitives  and  rendering  techniques,  faster 
polygon  rendering  is  the  most  often  expressed  need  for  many  applications:  3D  medical  imaging,  scientific 
visualization,  'virtual  worlds'  research.  We  therefore  set  a  goal  of  rendering  1  million  Z-buffered  Phong-shaded 
triangles  per  second,  assuming  the  average  triangle's  area  is  100  pixels  and  that  it  is  embedded  in  a  uiangle  strip. 
We  wanted  to  achieve  this  rate  without  using  any  special  structures  for  rendering  just  triangles  —  we  wanted  a 
system  for  much  more  man  triangles. 

•  Generality.  For  the  system  to  be  an  effective  base  for  algorithm  development,  it  needed  to  have  a  simple, 
general  structure  whose  power  was  readily  accessible  to  the  algorithm  developer  programming  in  a  high-level 
language.  We  wanted  it  to  have  sufficient  generality  for  rendering  curved  surfaces,  volume  data,  objects 
describe  with  Constructive  Solid  Geometry,  for  rendering  scenes  using  the  radiosity  lighting  model,  and  (we 
hoped)  for  a  variety  of  other  3D  graphics  tasks  that  we  have  not  yet  considered.  It  was  essential  that  the  system 
support  a  PHIGS-f  -like  environment  for  application  programmers  not  interested  the  system's  low-level  details. 
Further,  the  hardware  platform  should  be  flexible  to  allow  experiments  in  hardware  architectures. 

•  Packaging.  A  high-performance  configuration  that  met  our  primary  performance  goals  should  fit  within  a 
workstation  cabinet  with  no  unusual  power  requirements.  We  also  wanted  a  system  that  could  be  modularly 
built  and  flexibly  configured  to  trade  cost  for  performance.  The  system  should  drive  a  1280x1024  display  at 
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>60Hz,  and  be  able  to  update  full  scene  images  at  >20  frames/second. 

4.  Parallel  Rendering  by  Screen-space  Subdivision 

We  now  describe  the  scheme  we  use  in  PxplS  to  attain  high  levels  of  performance  in  a  compact,  modular, 
expandable  machine.  Our  previous  work  has  depended  on  a  single,  large  computing  surface  of  SIMD  parallel 
ptocesstvs  operating  on  the  entire  screen  space.  In  the  new  architecture,  we  instead  have  one  or  mme  small  SIMD 
engines,  called  Renderers,  that  operate  on  small,  sqjarate  128xl28-pixel  patches  in  a  virtual  pixel  space.  Virtual 
patches  can  be  assigned  on  the  fly  to  any  actual  patch  of  the  display  screen.  The  system  achieves  considerable 
q)eedup  by  simultaneously  processing  graphics  primitives  that  fall  entirely  within  different  patches  on  the  screen. 

The  principal  cost  of  this  screen-space  subdivision  scheme  is  that  the  primitives  handled  in  the  transformation  engine 
must  be  sorted  into  "bins"  corresponding  to  each  patch-sized  region  of  the  screen  space.  Primitives  that  fall  into 
more  than  one  region  are  placed  into  the  bins  for  all  such  regions.  The  simplest  (though  expensive)  way  to  support 
these  bins  is  to  provide  additional  storage  in  the  transformation  engine  for  the  entire,  sorted  list  of  output  primitives. 
Once  transformed,  sorted,  and  stored,  a  new  scene  is  rendered  by  assigning  all  available  Renderers  to  patches  on  the 
screen  and  dispatching  to  these  Renderers  primitives  from  their  corresponding  bins.  When  a  Rendeier  completes  a 
patch,  it  can  discard  its  Z-buffer  and  all  other  pixel  values  besides  colors;  pixel  color  values  are  transferred  from  on- 
chip  pixel  memory  to  the  secondary  storage  system,  or  "backing  store",  described  below.  The  Renderer  is  then 
assigned  to  the  next  patch  to  be  processed.  This  process  is  illustrated  in  Figure  1  for  a  system  configured  with  four 
Renderers. 

The  general  idea  of  multiple  independent  groups  of  pixel  processors  operating  on  disjoint  parts  of  the  display  screen 
was  described  in  several  of  our  earlier  publications  as  "buffered"  Pixel-planes.  What  is  new  about  this 
implementation  is  the  idea  of  flexibly  mapping  small  virtual  pixel  spaces  onto  the  screen  space.  It  allows  useful 
systems  to  be  built  with  any  number  of  small  rendering  units,  permits  cost/performance  to  be  traded  nearly  linearly, 
and  can  render  into  a  window  of  arbitrary  size  with  only  linear  time  penalty. 


Figure  1.  Rendering  process  for  a  PxplS  system  with  4  Renderers.  1280x1024  screen  is  divided  into  80  128x128 
patches.  Patches  are  processed  in  raster  order.  Renderers  1-4  are  assigned  initially  to  the  fust  four  patches.  Renderer 
#1  completes  first,  and  is  assigned  to  the  next  available  patch.  Next  Renderer  4  completes  its  first  patch  and  is 
assigned  to  the  next  available  patch,  and  so  forth. 
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The  virtual  pixel  approach  is  supported  in  the  PxplS  implementation  by  a  memory  hierarchy,  whose  elements  are: 
(1)  some  200  bits  of  fast  SRAM  associated  on-chip  with  each  pixel  processor,  (2)  a  "backing  store”  built  from 
VRAMs,  dghUy  linked  to  the  custom  logic-enhanced  memory  chips;  (3)  a  conventional  VRAM  frame  buffo'.  The 
backing  store  is  consists  of  an  array  of  VRAMs,  each  connected  via  its  video  port  to  one  of  our  custom  memory 
chips;  1MB  VRAMs  provide  4Kbits  of  storage  per  pixel.  The  backing  store  memory  is  available  through  the 
VRAM  random  I/O  port  to  the  rest  of  the  system,  which  can  read  and  write  pixel  values  in  the  conventional  way.  A 
Renderer  uses  this  memory  to  save  and  retrieve  pixel  values,  effectively  allowing  "context  switches”  when  the 
Renderer  ceases  operations  on  one  patch  and  moves  to  another.  A  typical  context  switch  takes  about  0.4  msec,  the 
time  to  tender  a  hundred  or  so  primitives,  and  can  be  fully  overlapped  with  pixel  processing. 

In  the  simple  multi-Renderer  scheme  described  above,  the  backing  store  is  used  to  store  pixel  color  values  for  patches 
of  the  screen  as  the  Renderer  completes  them.  When  the  entire  image  has  been  rendered,  each  of  these  regions  is 
transferred  in  a  block  to  the  (double-buffered)  display  memory  in  the  Frame  Buffer,  from  which  the  display  is 
refipeshed. 

5.  Architectural  Overview 

We  now  describe  the  architectural  features  of  the  PxplS  system  as  well  as  the  motivation  for  various  design 
decisions.  The  major  elements  of  the  design  are; 

•  Graphics  Processors  (GPs),  floating  point  engines,  each  with  considerable  local  code  and  data  storage. 

•  Renderers,  each  a  small  SIMD  array  of  pixel  processors  with  its  own  controller. 

•  Frame  Buffer,  double-buffered,  built  from  conventional  Video  RAMs,  from  which  the  display  is  refreshed. 

•  Host  Interface,  which  supports  communications  to/from  a  UNIX  workstation. 

•  Ring  Network  to  interconnect  the  various  processors  in  a  flexible  way. 


We  discuss  these  elements  in  more  detail  below. 
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Figure  2.  /’jipiJ  block  diagram. 


5.1  Ring  Network 

PxplS's  multi-processor  architecture,  motivated  by  the  desire  to  support  a  variety  of  graphics  tasks,  requires  a 
capable  communications  network.  Rather  than  build  several  specialized  communications  busses  to  support  different 
types  of  traffic  between  system  elements,  we  instead  provide  a  single,  flexible,  very  high  performance  network 
connecting  all  parts  of  the  system. 

At  rendering  rates  of  IM  primitives  per  second,  moving  object  descriptions  from  the  GPs  to  the  Renderers  requires 
up  to  40  million  32-bit  words/second  (40  MW/sec),  even  for  relatively  simple  rendering  algorithms. 
Simultaneously,  pixel  values  must  be  moved  from  the  Renderers  to  the  Frame  Buffer  at  rates  up  to  40  MW/sec,  for 
real-time  interactive  applications.  At  the  suggestion  of  J.  William  Poduska  of  Stellar  Computer,  Inc.,  we  explored 
technology  and  protocols  for  fast  ring  networks,  and  eventually  settled  on  a  multi-channel  token  ring.  Ring 
networks  have  many  advantages  over  busses  in  high-speed  digital  systems.  They  require  only  point-to-point 
communication,  thus  reducing  signal  propagation  and  power  consumption  problems,  while  allowing  a  relatively 
simple  communication  protocol. 

Our  network  can  support  eight  simultaneous  messages,  each  at  20  MW/sec  for  a  total  bandwidth  of  160  MW/sec. 
To  avoid  deadlock,  each  transmitting  device  gains  exclusive  access  first  to  its  intended  receiver,  then  to  a  data 
channel,  before  it  can  transmit  its  data  packet.  Each  Ring  Node  is  a  circuit  composed  of  commercial  MSI  bus- 
oriented  data  parts  and  field-programmable  controllers.  (At  the  expense  of  an  expensive  development  cycle,  the  Ring 
Network  could  be  reduced  to  one  or  a  few  ASICs).  The  controllers  operate  at  20MHz,  while  data  is  moved  at  40MH2 
(to  save  wires).  Each  client  processor  in  the  system  has  one  or  more  of  these  Nodes,  and  each  Node  provides  to  the 
client  a  20  MW/sec  bi-dircction  port  onto  the  Ring  network. 

We  have  developed  a  low-level  message-passing  operating  system  for  the  ring  devices  called  the  Ring  Operating 
System  (ROS).  It  provides  device  control  routines  as  well  as  hardware  independent  communication.  In  addition,  ROS 
controls  the  loading  and  initialization  of  programs  and  data. 
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5.2  Graphics  Processors 

The  performance  goals  we  have  set  require  sustained  computation  rates  in  the  "front  end"  of  several  hundred  MFlops, 
feasible  today  only  in  parallel  or  vector  architectures.  We  elected  to  build  a  MIMD  transformation  unit;  this 
organization  handles  PHIGS-f-like  variable  data  structures  better  than  would  a  vector  unit,  and  supports  the  "bins" 
needed  for  our  screen  subdivision  multi-Renderer.  Weitek  "XL”  processors  were  used  primarily  because  of  the 
existence  d  mature  compilers  and  assemblers. 

Much  of  the  system's  complexity  is  hidden  by  ROS;  the  programming  model  is  therefore  relatively  simple.  Load 
sharing  is  accomplished  by  dividing  a  databases  across  the  GPs,  generally  with  each  GP  running  the  same  code. 
Since  the  GPs  are  programmable  in  the  C  language,  users  have  access  to  the  machine's  full  capability  without 
needing  to  write  microcode. 

S3  Renderer 

Section  4  describes  the  essentials  of  the  Renderer  design,  whose  block  diagram  is  shown  in  Figure  3.  It  is  based  on 
a  logic-enhanced  memory  chip  built  using  I.64  CMOS  technology  and  operating  at  40MHz  bit-serial  instruction 
rates.  In  addition  to  256  pixel  processing  elements,  each  with  208  bits  of  static  memory,  the  chip  contains  a 
quadratic  expression  evaluator  (QEE)  that  produces  the  function  Ax+By+C+Dx^+Exy+Fy2  simultaneously  at  each 
pixel  [Goldfeather  86b].  Quadratic  expressions,  while  not  essential  for  polygon  rendering,  are  very  useful  for 
rendering  curved  c.«^aces  and  for  computing  a  spherical  radiosity  lighting  model  (sec  Section  7.6). 


Figure  3.  Block  diagram  of  a  PxplS  Renderer.  Pixel  processor  array  implemented  in  64  custom  chips,  each  with 
2  columns  of  128  pixel  processors-with-memory  and  a  quadratic  expression  evaluator. 

A  major  design  issue  for  the  Renderer  was  choosing  the  size  of  the  processor  array.  The  effectiveness  of  the  screen- 
space  subdivision  scheme  for  parallel  rendering  is  determined  in  part  by  the  frequency  with  which  primitives  must  be 
processed  in  more  than  one  region,  and  this  in  turn  depends  on  the  size  of  the  Rendeicr's  patch.  On  one  hand. 
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economy  of  use  of  the  fairly  expensive  custom  chips  of  the  process^  array  and  the  need  to  leverage  performance  by 
dividing  the  rendering  work  across  as  many  processors  as  possible  argue  for  smaller  Renderer  patches.  A  large 
Renderer  patch,  on  the  other  hand,  reduces  the  likelihood  that  primitives  will  need  to  be  processed  more  than  once. 
We  elected  a  128x128  Renderer  size;  it  is  fairly  efficient  for  small  primitives,  and  its  hardware  conveniently  fits  on  a 
reasonable  size  printed  circuit  board. 

5.4  Frame  Buffer  and  Host  Interface 

The  Frame  Buffer  is  built  in  a  fairly  conventional  way  using  Video  RAMs.  It  supports  a  1280xl024-pixel,  72Hz 
re&esh<rate  display,  24-bit  true  color  and  a  color  lookup  table.  Display  modes  include  stereo  (alternating  frames)  and 
a  hardware  2x  zoom.  The  Frame  Buffer  is  accessed  through  two  Ring  Nodes,  to  provide  an  aggregate  bandwidth  of 
40  MW/sec  into  the  buffer,  allowing  up  to  24  Hz  updates  for  full-size  images.  PxplS  is  hosted  by  a  Sun  4 
worirstation.  Host  communications  is  via  programmed  I/O,  {Ht>viding  up  to  about  4  MBytes/sec  of  bandwidth. 

5.5  Performance 

Since  the  transformation  engine  in  PxplS  is  based  on  the  same  processor  used  in  Pxpl4,  we  estimate,  based  on  the 
earlier  machine's  performance,  that  a  GP  can  process  of  order  30,0(X)  Phong-shaded  triangles  per  second;  32  GPs  are 
therefore  required  to  meet  our  performance  goal.  A  single  Renderer  has  a  raw  performance  of  about  1S0,(XX)  Phong- 
shaded  triangles  per  second;  actual  performance  is  reduced  somewhat  by  inefficiencies  resulting  from  primitives  that 
must  be  processed  in  more  than  one  patch.  Simulations  predict  an  actual  performance  of  around  100,(XX) 
triangles/sec,  so  a  configuration  to  meet  the  performance  goals  will  require  8-10  Renderers. 

6.  PPHIGS  Graphics  Library 

PxplS  may  be  programmed  at  various  levels.  We  anticipate  users  ranging  from  application  programmers,  who 
simply  desire  a  fast  rendering  platform  with  a  PHIGS-»-  -style  interface  [van  Dam  88],  to  algorithm  prototypers,  who 
need  access  to  the  renderer's  low-level  pixel  operations  and  may  depart  from  the  PH1GS+  paradigm.  To  meet  these 
disparate  needs,  several  layers  of  support  software  are  required.  Program  initialization  and  message  passing  between 
processors  are  handled  by  the  Ring  Operating  System  (ROS).  A  local  version  of  PHIGS+  (Pixel-planes  PHIGS  or 
PPHIGS)  provides  a  high-level  interface  for  users  desiring  portable  code.  This  ser  ion  describes  PPHIGS. 

PPHIGS  makes  the  hardware  appear  to  the  "high-level"  graphics  programmer  very  much  like  any  other  graphics 
system;  the  programmer's  code  (running  on  the  host)  makes  calls  to  the  graphics  system  to  build  and  modify  a 
hierarchical  data  structure.  This  structure  is  traversed  by  the  PPHIGS  system  to  create  the  image  on  the  screen. 

6.1  Database  Distribution 

Since  the  applications  programming  library  is  based  on  PHIGS,  it  allows  the  programmer  to  create  a  display  list  that 
is  a  directed  acyclic  graph  of  structures.  These  structures  contain  elements  that  are  either  graphics  primitives,  state¬ 
changing  commands,  or  calls  to  execute  other  structures.  To  lake  advantage  of  the  multiple  graphics  processors  in 
PxplS,  we  must  distribute  the  database  structure  graph  across  the  graphics  processors  in  a  way  that  balances  the 
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computational  load,  even  in  the  presence  of  editing  and  changes  in  view.  To  achieve  this  we  distribute  each 
structure's  primitives  across  the  GPs  instead  of  placing  an  entire  structure  on  one  GP. 

Because  PHIG3  is  a  stateful  system  where  child  structures  inherit  information  from  their  parents,  the  state-changing 
primitives  as  well  as  structure  executions  must  be  replicated  on  each  GP.  This  replication  should  not  be  a  problem, 
siiKe  we  expect  the  majority  of  structure  elements  to  be  graphics  primitives  and  not  state-changing  ones.  We  have 
devised  other  distribution  schemes  for  applications  that  violate  this  assumption.  The  display  list  traversal  and 
rendering  routines  are  not  affected  by  the  distribution  scheme. 

6.2  The  Rendering  Process 

The  rendering  process  is  controlled  by  a  designated  graphics  processor,  the  master  GP  or  MGP.  By  exchanging 
messages  with  other  GPs  and  sending  commands  to  other  modules  when  necessary,  the  MGP  synchronizes 
operations  throughout  the  system. 

Before  discussing  the  steps  in  the  rendering  process,  we  first  want  to  emphasize  the  distinction  between  pixel 
operauons  that  take  place  on  a  per  primitive  basis,  such  as  Z  comparison  and  storage,  and  those  that  can  be  deferred 
until  the  end  of  all  primitive  processing  or  end-of-frame.  Shading  calculations  from  intermediate  values  stored  at 
the  pixels,  for  instance,  need  only  be  performed  once  per  pixel,  rather  than  once  per  primitive  (assuming  there  is 
sufficient  pixel  storage  to  hold  the  intermediate  values  until  end-of-frame).  At  that  time,  the  final  pixel  colors  for 
every  pixel  in  the  128  x  128  pixel  renderer  can  be  calculated  from  the  stored  values.  This  savings  results  in  about  a 
80x  speedup  for  Phong  shading  over  doing  the  Phong  final  calculations  after  every  primitive  is  processed.  For  more 
expensive  lighting  and  shading  models  (such  as  texture)  this  speedup  is  critical  for  making  the  algorithm  practical. 

The  major  steps  in  the  rendering  process  are: 

1 .  The  application  program  running  on  the  host  edits  the  database  using  PPHIGS  library  routines  and  transmits 
these  changes  to  the  GPs. 

2.  Application  requests  a  new  frame.  Host  sends  this  request  the  MGP,  which  relays  it  to  the  other  GPs. 

3 .  The  GPs  interpret  the  database,  generating  renderer  commands  for  each  graphics  primitive.  These  commands  are 
placed  into  the  local  bins  corresponding  to  the  screen  regions  where  the  primitive  lies.  Each  GP  has  a  bin  for 
every  128x128  pixel  region  in  the  window  being  rendered. 

4.  The  GPs  send  bins  of  containing  commands  to  renderers.  The  renderers  execute  commands  and  compute 
intermediate  results. 

5.  The  GP  sending  the  final  bin  to  a  tenderer  also  sends  cnd-of-fiame  commands  for  the  region.  The  renderers 
execute  these  commands  and  compute  final  pixel  values  from  the  intermediate  results. 

6.  The  renderers  send  computed  pixels  to  the  frame  buffer. 

7.  When  all  regions  have  been  received,  the  frame  buffer  swaps  banks  and  displays  the  newly <omputed  frame. 

The  MGP  assigns  renderers  to  screen  regions  while  the  frame  is  being  rendered.  It  communicates  a  renderer 
assignment  to  the  GPs  by  sending  a  message  to  one  GP,  which  sends  its  associated  bin,  and  then  forwards  the 
message  to  the  next  GP,  which  does  the  same.  At  the  end,  the  message  is  sent  back  to  the  MGP,  indicating  that  all 
the  bins  have  been  processed.  This  method  ensures  that  at  most  one  GP  attempts  to  transmit  to  a  renderer  at  a  given 
time.  This  prevents  blocked  uansmissions,  which  would  slow  throughput. 
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The  steps  of  the  rendering  process  can  be  overlapped  in  several  ways;  at  maximum  throughput,  several  frames  may 
be  in  progress  at  once.  The  MGP  handles  synchronization  to  keep  the  frames  properly  separated  [Ellsworth  89]. 

7.  Rendering  Algorithms 

We  now  discuss  various  rendering  algorithms  in  turn.  Some  of  these  have  been  published  before,  m  which  case,  we 
review  their  applicability  to  PxplS  and  give  performance  estimates.  We  also  report  new  techniques  for  efficiently 
displaying  procedural  textures  and  conic  spline-defined  fonts,  for  calculating  radiosity  form-factors,  and  for  displaying 
volume-defined  images  at  interactive  rates. 

7.1  Phong  Shading 

Since  PxplS  can  evaluate  quadratic  expressions  rapidly,  we  could  implement  Phong  shading  using  Bishop  and 
Wiemer's  Fast  Phong  Shading  technique  [Bishop  86].  Unfortunately,  this  requires  large  amounts  of  computation  by 
the  front-end  processors  to  determine  the  quadratic  coefficients.  Since  we  estimate  that  the  renderets  will  usually  have 
more  idle  capacity  than  the  GPs,  we  have  decided  to  use  a  scheme  which  pushes  most  of  the  computation  to  the 
tenderers  and  is  closer  to  the  original  Phong  formulation  [Phong  73]. 

As  polygons  and  other  primitives  are  processed,  the  x,  y,  and  z  components  of  the  surface  normal  are  stored  in  all  the 
pixels  where  the  primitive  is  visible.  For  polygons  this  is  done  by  simple  linear  interpolation  of  each  component 
When  all  the  primitives  for  a  region  have  been  processed,  the  pixel-parallel  end-of-frame  operations  are  performed. 
The  normal  vector  is  normalized  by  dividing  by  the  square  root  of  its  length,  which  is  computed  using  a  Newton 
iteration.  Once  this  is  done  the  color  for  each  pixel  is  computed  using  the  standard  Phong  lighting  model. 

Simulation  indicates  that  the  end-of-frame  computation  for  the  Phong  lighting  model  with  a  single  light  source 
consumes  around  23,000  tenderer  cycles  or  .57  milliseconds.  With  full  screen  resolution  of  1024  by  1280  and  a  16 
tenderer  system,  the  total  end-of-frame  time  is  .57msec  •  (80/16)  or  2.85msec  per  frame.  At  24  frames  per  second 
this  is  6.8  percent  of  the  rendering  time. 

7.2  Spheres 

PxplS  can  render  spheres  using  the  same  algorithm  as  on  Pxpl4  (Fuchs  85],  but  is  both  faster  (taking  advantage  of 
the  QEE),  and  can  generate  higher-quality  images  (Phong  shading  with  24-bit  color).  Phong  shading  is  achieved  as 
follows.  The  expressions  for  the  coordinates  of  the  surface  normal  for  a  sphere  are: 

nx  =  (1/r)  •  (x  -a) 
ny  =  (1/r)  •  (y  -  b) 

nz  =  (•^)  -  (x  -  a)^-  (y  -  b)^ 

r 


The  expression  for  nz  approximated  by  a  parabola: 
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nz  » (-^)  •  (r^  -  (x  -  a)^  -  (y  -  b)^ 
r 

Theo  the  nonnals  are  computed  at  each  pixel  by  broadcasting  two  linear  expressions  and  one  quadratic  expression. 
Results  finom  simuladon  indicate  that  this  approximation  produces  satisfactory  shading  including  the  specular 
highlights.  Assuming  one  light  source  and  24  firames  per  second,  we  estimate  the  system  perfcamance  to  be  1.8M 
q)heres  per  second  for  100  pixel  area  spheres  and  900K  spheres  per  second  for  1600  pixel  area  spheres. 

73  Shadows 

Pxp/4  generates  images  with  shadows  very  rapidly  —  nearly  half  as  fast  as  images  without  shadows  [Fuchs  85]. 
Unfortunately,  this  figure  does  not  scale  up  by  the  usual  20x  for  PxplS,  since  the  performance  increase  from  the 
screen  space  subdivision  does  not  extend  to  shadow  volumes,  which  frequently  cross  many  screen  regions.  At  worst, 
every  shadow  volume  edge  could  be  processed  in  every  region,  increasing  the  display  list  size  by  as  much  as  80/1.4  = 
S7x  for  a  1280  x  1024  image.  A  nominal  PxplS  configuration  has  16  renderers.  each  running  at  40MHz  as 
opposed  to  Pxpl4'%  8MHz.  The  shadow  algorithm  might  be  expected  to  run  about  the  same  speed  on  PxplS  as  on 
Pxpl4.  Various  optimizations  do  exist;  for  example,  shadow  boundary  edges  need  not  be  processed  in  regions  lying 
between  a  polygon  and  the  light  source.  We  have  not  yet  explored  these  options  in  depth.  Because  of  the  problems 
mentioned  above,  we  anticipate  increasing  use  of  the  fast  radiosity  technique  described  in  Section  7.6. 

7.4  Texture  Mapping 

We  have  previously  reported  a  technique  to  compute  the  u.v  texture  coordinates  for  polygons  in  perspective  [Fuchs 
85].  The  speed  of  this  technique  is  limited  by  the  time  to  broadcast  the  individual  texnire  values  to  the  pixels.  While 
64  X  64  -image  textures  run  at  interactive  rates  on  Pxpl4  (see  Rgurc  4),  a  more  efficient  method  for  PxplS  is  to 
calculate  the  texture  values  directly  in  each  pixel.  Broadcasting  the  texture  values  will  be  significantly  faster  on 
PxplS  than  on  Pxpl4,  since  texture  values  can  be  stored  in  bins  and  only  broadcast  when  needed  for  one  or  more 
pixels  of  a  region. 


Figure  4.  Mandrill  mapped  onto  a  plane  and  hoop  on  Pxpl4.  Estimated  rendering  time  on  PxplS  is  31  msec. 
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Procedural  Textures.  We  have  begun  to  explore  procedural  textures,  as  shown  by  Perlin  [Perlin  85]  and  Gardner 
[Gardner  881,  for  use  in  PxplS.  We  have  written  a  program  for  Pxpl4  that  allows  one  to  explore  in  real-time  the 
qnce  of  textures  possible  using  Gardner's  technique.  This  program  and  software  written  by  Douglass  Turner  were 
used  to  create  the  textures  shown  in  Figure  S. 


Figure  5.  Procedural  earth,  water,  sky,  and  fire  textures  with  brick  image  texture  (simulated).  Estimated  rendering 
time  on  PxplS  is  S.5  milliseconds. 

The  two-dimensional  Gardner  spectral  functions  are  calculated  using  quadratic  approximations  for  the  cosine 
functions.  This  requires  nine  multiplies  per  term  plus  one  multiply  to  combine  the  x  and  y  directions.  Difterent 
textures  for  different  pixels  can  be  computed  simultaneously.  The  images  shown  in  the  figure  contains  five  terms. 
On  PxplS  they  would  require  about  15,000  cycles  or  360  microseconds  using  10  bits  of  resolution.  These 
procedural  methods  can  be  anti-aliased  by  eliminating  high  frequency  portions  of  the  texture;  terms  whose 
wavelength  spans  less  than  one  pixel  are  simply  not  computed  [Norton  82]. 

Image-based  Textures.  We  have  explored  both  summed  area  tables  [Crow  84]  and  mip-maps  [Williams  83]  for 
anti-aliasing  image  textures.  We  feel  that  mip-maps  will  work  best  on  PxplS.  During  rendering  the  mip-map 
interpolation  value  can  be  linearly  interpolated  across  the  polygon.  At  end  of  frame,  the  mip-map  is  broadcast  to  each 
pixel-processor,  and  each  processor  loads  the  pixel  at  its  u.v  coordinate  along  with  neighboring  values  for 
interpolation. 

7.5  Fonts 


Herve  Tardif  has  been  developing  methods  for  rapidly  rendering  fonts.  Conic  splines,  as  advocated  by  several 
researchers  [Pavlidis  83,  Pratt  85],  arc  particularly  well  suited  for  rendering  by  PxplS',  with  the  QEE  in  the 
processor-enhanced  memories,  PxplS  can  directly  scan  conven  conic  section,  from  which  characters  are  defined. 
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Initially,  a  character  is  represented  by  a  sequence  of  straight  line  segments  and  aics  of  conics  joined  together  in  the 
pbma.  As  by  Pratt,  each  arc  of  a  conic  is  in  turn  represented  by  three  points  M,  N.  P  and  a  scalar  S  which 

measures  the  departure  of  the  conic  from  a  parabola  (Hgure  6ji).  Hence,  a  letter  can  be  represented  either  by  a 
simple  closed  polygon  or.  for  letters  with  holes,  two  or  more  polygons.  The  character  is  inidaily  converted  into  the 
difference  between  its  unique  convex  hull  and  the  discrepancy  with  that  hull.  (Holes  are  treated  the  same  as  other 
discrepancies.)  The  process  is  repeated  if  the  discrepancy  region(s)  are  concave.  This  process  amounts  to  building  a 
tree  whose  leaves  are  convex  regions  and  nodes  are  set  operators  (Tor  84]  (see  Figure  6.b).  A  character  is  rendered 
by  traversing  its  corresponding  tree,  scan  converting  each  convex  region  in  turn. 


Figure  6.  Conic  font  constructed  by  regions  bounded  by  lines  and  conic  sections. 

Consider  a  convex  region  obtained  through  this  process.  For  edges  corresponding  to  straight  line  segments  the 
coefficients  for  that  line  are  sent  to  the  QEE.  For  two  consecutive  edges  MN  and  NP  representing  an  arc  of  conic, 
the  coefficients  of  the  straight  line  (MP)  are  first  sent  to  the  QEE.  Then  the  quadratic  coefficients  for  the  conic 
section  are  derived  and  sent  to  the  QEE  in  order  to  scan  convert  the  region  enclosed  between  the  line  MP  and  the 
conic  section.  This  process  is  repeated  until  all  convex  regions  have  been  processed.  Figure  6.c  shows  the  regions 
that  are  successively  scan  convened  for  the  leuer  P.  It  is  possible  for  the  two  segments  MN  and  NP  that  describe  an 
arc  to  be  split  into  two  different  convex  regions  during  the  decomposition  process.  In  that  case,  the  edges  MN  and 
NP  are  considered  as  simple  segments  of  the  character  definition  until  all  convex  regions  have  been  processed.  Then, 
the  region  enclosed  between  the  conic  and  the  segments  MN  and  NP  is  either  added  or  subtracted  from  the  current 
construction  (see  Figiue  6.d).  Since  conic  sections  are  invariant  under  projective  maps,  this  technique  can  also  be 
applied  to  the  rendering  of  planar  characters  embedded  in  a  3D  environment 

Performance  estimates  have  been  obtained  from  a  conic  representation  of  a  Times  Roman  font  given  to  us  courtesy 
of  Michael  Shantz  of  Sun  Microsystems.  The  average  number  of  convex  polygons  per  character  in  this  set  is  8.12, 
the  average  number  of  straight  edges  per  polygon  is  4.13,  and  the  average  number  of  conics  per  character  is  S.4. 
This  indicates  that  the  average  character  can  be  scan-converted  with  36  linear  coefficients  and  8.4  quadrauc 
coefficients.  This  suggests  that  each  renderer  can  scan-convert  over  20,000  letters  per  second.  Assuming  each 
character  falls  into  an  average  of  1.4  rendering  regions,  16  tenderers  can  draw  over  223,000  letters  per  second. 
Graphics  Processors  will  have  difficulty  keeping  up  with  this  rendering  rate.  GPs  can  cache  renderer  commands  for 
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2D  applicadons. 

7.6 .  Fast  Radiosity 

The  radiosity  lighting  model  offers  dramatically  improved  realism  for  certain  types  of  images  [Immel  86,  Wallace 
87].  The  progressive  radiosity  approach  [Cohen  88]  would  be  well-suited  for  interactive  applications  if  images  could 
be  computed  more  rapidly.  PxplS's  tenderers  allow  us  to  greatly  accelerate  the  progressive  radiosity  method.  We 
have  developed  an  algorithm  for  computing  projections  of  3D  polygons  onto  a  hemisphere,  which  speeds  the 
projection,  scan  conversion  and  visibility  calculations  necessary  to  distribute  light  from  a  light  source  to  the  patches 
in  the  scene.  Instead  of  storing  color  values  at  the  pixels,  we  store  the  patch  id  of  the  nearest  visible  patch.  Once  all 
the  patches  in  a  scene  have  been  processed,  the  visible  patch  matrix  is  sent  over  the  ring  network  to  a  GP,  which 
scans  through  the  matrix,  updating  the  radiosities  of  patches  indicated  by  the  patch  ids. 


u 


Figure  7.  Hemispherical  projection  of  a  triangle. 

In  the  unusual  scan  conversion  process,  the  edges  of  a  polygon  map  to  ellipses  in  screen-space.  The  PxplS  QEE 
computes  these  ellipses  directly  to  scan  convert  the  polygons'  projections  into  pixel  space.  Figure  7  illustrates  the 
scan  conversion  process.  Z-buffering  can  be  performed  using  approximations  or  by  storing  a  special  constant  term 
for  each  pixel  [Goldfeather  89].  Figure  8  shows  the  result  of  the  hemispherical  projection  and  Z-buffered  rendering  of 
a  room  environment. 

This  technique  can  compute  these  radiosity  form-factors  in  one  pass  instead  of  the  five  passes  that  would  be 
necessary  in  a  hemi-cube  implementation  —  and  even  this  one  pass  could  be  done  at  PxplS  polygon  rendering  rates. 
Since  the  resolution  within  a  single  renderer  appears  to  be  more  than  adequate  for  this  calculation,  multiple  renderers 
can  be  used  independently.  Each  renderer  should  be  able  to  process  about  100,000  quadrilaterals  per  second.  If  the 
GPs  cannot  keep  up,  we  may  be  calculate  the  form  factors  at  reduced  resolution,  reducing  the  number  of  patch  id's 
the  GPs  need  to  tally. 


Displaying  the  radiosity  image  is  performed  in  the  conventional  manner  vertex  colors  are  computed  from  patch 
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ladiosities,  and  linear-inteipolaiioa  is  used  to  blend  colon  smoothly  across  each  patch. 


Figure  8.  (a)  Hemispherical  projwtion  of  Tebbs  and  Turk's  office,  generated  on  the  Pxpl5  simulator.  Estimated 
rendering  time  on  Pxpl5  is  2.8  milliseconds,  (b)  Standard  view  of  the  same  room  as  in  (a),  displayed  on  Pxpl4 
(radiosity  software  described  in  [Airey  89]).  The  viewpoint  in  (a)  is  fitom  the  illuminated  light  fixmre. 


7.7  Volume  rendering 

One  example  of  PxplS's  generality  is  its  ability  to  perform  volume  tendering.  Marc  Levoy  plans  to  implement  a 
version  of  the  algorithm  described  in  [Levoy  88a,  88b,  89}.  To  Iniefly  summarize  the  algorithm:  We  begin  with  a 
3D  array  (rf  scalar-valued  voxels.  We  first  classify  and  shade  the  array  based  on  the  function  value  and  its  gradient  to 
yield  a  col<w  and  an  opacity  for  each  voxcL  Parallel  viewing  rays  arc  then  traced  into  the  array  from  an  observer 
position.  Each  ray  is  divided  into  equally  spaced  sample  intervals,  and  a  color  and  opacity  is  computed  at  the  center 
of  each  interval  by  tri-linearly  interpolating  from  the  colors  and  opacities  of  the  nearest  eight  voxels.  The  resampled 
colors  and  opacities  are  then  composited  in  front-to-back  order  to  yield  a  color  for  the  ray. 

For  PxplS,  we  propose  to  store  the  function  value  and  gradient  for  several  voxels  in  the  backing  store  of  each  pixel 
processor.  The  processor  then  performs  classification  and  shading  calculations  for  all  voxels  in  its  backing  store. 
The  time  to  apply  a  monochrome  Phong  shading  model  at  a  single  voxel  using  a  pixel  processor  is  about  I  msec. 
For  a  256  x  256  x  256  voxel  dataset,  each  pixel  processor  would  be  assigned  64  voxels,  so  the  time  required  to 
classify  and  shade  the  entire  dataset  would  be  about  64  msec. 

GPs  perform  the  ray-tracing  to  generate  the  image.  They  are  each  assigned  a  set  of  rays  and  request  sets  of  voxels 
from  the  pixel  processors  as  necessary.  The  GPs  perform  the  tri-linear  interpolation  and  compositing  operations, 
then  transmit  the  resulting  pixel  colors  to  the  frame  buffer  for  display. 

The  success  of  this  approach  depends  on  reducing  the  number  of  voxels  flowing  from  the  pixel  processors  to  the 
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GPs.  Three  iBraifg*"*  are  planned:  Hist,  a  hierarchical  enumeration  of  the  volumetric  dataset  [Levoy  88b]  will  be 
io  each  graphics  processor.  This  data  structure  encodes  the  coherence  ptiesent  in  the  dataset,  telling  the 
graphics  processor  which  voxels  are  interesting  (non>transparent)  and  hence  wrvth  requesting  from  the  pixel 
processors.  Second,  the  adaptive  sampling  scheme  described  in  [Levoy  89]  will  be  used  to  reduce  the  number  of  rays 
required  to  generate  an  initial  image.  T-a«,  all  voxels  received  by  a  graphics  processor  will  be  retained  in  a  local 
If  the  observer  (toes  mx  move  during  generation  of  the  initial  image,  the  cached  voxels  will  be  used  to  drive 
successive  refinement  of  the  image.  If  the  observer  moves,  many  of  the  voxels  required  to  generate  the  next  firame 
may  already  reside  in  the  cache,  depending  on  how  far  the  observer  moves  between  firames. 


Figure  9.  Volume-rendered  head  from  CT  data,  generated  on  a  Sun  4.  Estimated  rendering  time  on  Pxpl5  is  1 
second.  Photo  courtesy  of  Marc  Levoy. 

The  frame  rate  we  expect  from  this  system  depends  on  which  parameters  change  from  frame  to  frame.  Preliminary 
estimates  suggest  that  for  changes  in  observer  position  alone,  we  will  be  able  to  generate  a  sequence  of  slightly 
coarse  images  at  10  frames  per  second  and  a  sequence  of  images  of  the  quality  of  Figure  9  at  1  frame  per  second.  For 
changes  in  shading  or  changes  in  classification  (hat  do  not  invalidate  the  hierarchical  enumeration,  we  expect  to 
obtain  about  20  coarse  or  2  high-quality  images  per  second.  This  includes  highlighting  and  interactively  moving  a 
region  of  interest,  which  we  plan  to  implement  by  heightening  the  opacity  of  voxels  inside  the  region  and 
attenuating  the  opacities  of  voxels  outside  the  region.  If  die  user  changes  the  classification  mapping  so  that  the  set 
of  interesting  voxels  is  altered,  the  hierarchical  enumeration  must  be  recomputed.  We  expect  this  operation  to  take 
several  seconds. 

7.8  Rendering  CSG-defined  Objects 

We  and  others  have  developed  algorithms  to  directly  render  Consductive  Solid  Geometry  (CSG)  defined  objects  on 
graphics  systems  with  deep  frame  buffers  (Goldfcaiher  86a,  Jansen  86,  Rossignac  86].  On  Pxpl4  we  developed  a 
CSG  modeler  that  displays  small  datasets  at  interactive  rates  (Goldfcather  88]. 

PxplS  provides  several  opportunities  to  increase  rendering  speed:  the  QEE  on  PxplS  renders  curved- surfaced 
primitives  without  breaking  them  into  polygonal  facets;  having  more  bits  per  pixel  allows  surfaces  that  are  used 
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multiple  tunes  to  be  stoted  and  re-used,  rather  than  being  re-rendered,  greatly  increasing  performance;  finally,  the 
screen-subdivision  technique  advocated  in  [Jansen  87]  provides  a  way  to  take  advantage  of  PxplS'i  multiple 
renderers.  Pjpl4  interactively  renders  CSG  objects  with  dozens  of  primitives  (Figure  10).  We  expect  PxplS  to 
interactively  render  objects  with  hundreds  of  primitives. 


Figure  10.  CSG-modeled  truck  generated  on  Pxpl4.  Estimated  rendering  time  on  PxplS  is  40  milliseconds. 


7.9  Transparency 

A  number  of  methods  for  rendering  transparent  surfaces  are  possible,  given  the  generality  and  power  of  PxplS.  The 
most  promising  is  to  enhance  the  bin  sorting  in  each  GP  to  generate  twice  as  many  bins,  one  for  transparent  and 
another  for  opaque  primitives  for  each  region.  The  transparent  primitives  are  rendered  after  aU  the  opaque  ones. 
Since  we  expect  relatively  few  transparent  polygons,  each  of  the  "transparent"  bins  can  be  sorted  from  back  to  front 
and  rendered  by  simple  composition.  For  difficult  cases,  in  which  a  cluster  of  uansparent  polygons  cannot  be  sorted 
in  Z  (as  in  a  basket-weave  of  transparent  strips),  multiple  Z  values  can  be  stored  at  each  pixel  to  control  the 
compositing  step.  With  this  approach,  difficult  primitives  may  need  to  be  sent  to  renderers  several  times  to  ensure 
correct  blending. 

A  second  method,  stochastically  sampling  the  pixels  of  transparent  primitives,  is  currently  being  used  on  Pxpl4.  It 
is  very  simple  and  efficient,  but  requires  several  anti-aliasing  passes  to  produce  an  acceptable  image  (without 
antialiasing,  primitives  appear  splotchy).  With  PxplS's  increased  speed  this  method  may  perform  so  well  that  more 
complicated  algorithms  are  not  needed. 


8.  Conclusions 

It  is  too  early  to  conclude  much  about  the  potential  usefulness  of  Pixel-planes  5.  We  hope  that  with  its  generality 
and  simple  conceptual  structure,  it  will  prove  useful  for  our  local  colleagues'  activities  in  highly  interactive  3D 
graphics.  We  are  convinced  that  even  experimental  machines  like  this  one  should  be  built  for  a  community  of  users 
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who  can  dispassionately  evaluate  their  utility.  The  heavy  local  use  of  its  predecessor,  Pxpl4,  has  contributed 
substantially  to  the  ideas  in  PxplS. 

With  the  i^id  development  of  high-performance  graphics  engines  in  the  past  few  years,  it  is  difficult  to  detennine 
which  of  the  many  approaches  will  continue  to  be  useful  in  the  future.  Among  the  safest  predictions  is  that  scan¬ 
line  (ndeied  pipeline  processing  will  continue  to  be  a  cost-efiective  solution  to  rendering  specific  sets  of  primitives 
and  that  parallel  screen  subdivision  will  continue  to  be  useful  for  general  purpose  image  generation.  Within  this 
latter  approach,  we  speculate  that  there  may  be  a  convergence  between  current  parallel  solutions  (4x4-pixel  fooqnint 
of  the  Stellar  GS-1000,  the  4xS  footprint  of  SGI,  the  8x8  footprint  of  Pixel  Machine)  and  the  128xl28-pixel 
footprint  of  PxplS.  Once  the  size  of  the  footprint  becomes  large  enough  so  most  primitives  fall  into  only  a  single 
region,  the  rendering  can  be  done  independently  for  each  region  with  little  penalty  for  duplication  of  primitives 
among  the  multiple  regions.  With  VLSI  and  ULSI  technology,  it  will  be  increasingly  practical  to  have  such 
footprints  that  are  sufiiciently  large  to  simplify  the  processing  in  this  way. 

Current  status  of  PxplS.  As  of  January  10.  hardware  and  software  are  being  builL  Of  the  three  custom  CMOS 
VLSI  chips  being  designed,  the  processor-enhanced  memory  and  the  backing  memory  interface  are  both  in  final 
simulations,  projected  to  be  sent  to  fabrication  in  the  next  few  weeks.  The  third  chip,  the  renderer  controller,  is  in 
middle  of  layout  Detailed  simulation  of  the  board-level  logk:  design  is  well  along,  and  PCBs  are  being  designed.  A 
small  version  of  the  Ring  Netwoik  with  a  pair  of  Graphics  Processors  is  expected  to  become  operational  in  March, 
with  a  small  complete  system  running  in  July.  On  the  software  front  a  high-level  language  porting  base  is  running 
simple  code.  Renderer  simulator  is  yielding  useful  images. 
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